mcmsm bisoms 



ED 137 318 



95 



OOS 930 



lOIHOfi 



StOHS AGEHCY 
PUB DlTl 



kVAlLkBLE FEOB 



BDBS PEICE 
BESCBIPTOBS 



IDBMIIPXIES 



Barker # PiaEcai BplaFin, Sol H» 

Issues of laliability and Diractional Bias in 

Standardized lohieveMent Testsi iha Case of Mat70^ 

P-S68S. 

National Inst, of Iducation pHEff}.^ Washington^ 
Jul 76 

31p*.| Paper presentad at the Innaal Meeting of thf 
imerican iducational lesearch lasociation (60thr San 
Francisco,. California^ Ipril 19*23^ 1S76) 
The Band Corporation^ 1760 Main St*, Santa Monica, 
California 90406 CS3«00) 

flF^$0*83 HC-$2*06 Plus Postage. 

Achie¥amant Gains; ^lcihie¥ement Tests; Iducational 
DiaadFantageMenti Elasentarj Education | Grade 
Iguivalent Scores.| Program lvalaatiott|/ law Scores^ 
Scores I Standard Error of Measureaenti ^Standardised 
Ta^tsi Statistical Analysis i feat Biasi ^lasting 
Prohlemss lest Interpretation | *Iest Seliabilitf i 
lest Validity 

Educational Vbucher Demonstration i *Metropolitan 
Achiavament lestsi Out of Xevel Testing! Standard 
Scores 



IBSTEACI 

This study was mounted to assess the validity of 
standard score transformations of raw test scores and test bias on 
the 1970 edition of the Metropolitan IchieTeaent Test Battery, in the 
content of a controversial fedarally funded compensatjory education 
program, the EdUv^atlonal Voucher Demonstration (EVD) , On an 
individual level the validity of the Standard Score scale has not 
been demonstrated. Moreover, substantial bias was found in aggregate 
measures (means anfl medians) betv^en adjacent difficulty levels, For 
these raasona, the authors could not conclude vitih any confidence 
that the instrument herein assessed could provide dependable bases 
either for individual student assesssent, or program evaluation as 
usually perforiied, (MV) 



* Documents acg aired by BEIC include many informal unpublished * 
^ materials not available from other sources, EHIC makes every effort * 

* to obtain the best copy available, Nevertheless, items of marginal * 

* reproducibility are often encountered and this affects the quality * 

* of the microfiche and hardcopy reproductions EfilC makes available * 

* via the ElIC Document leproduction Service (EDES) , EDES is not ^ 

* responsible for the quality of the original document, Eeproductions * 

* supplied by EDES are the beat that can be made from the original. * 



ERLC 



CO 



UJ 



ISSUES OF RELIABILITY DIRECTIONAL BIA0 IN STAl^AfeDIZED 

ACHIEVQIENT TESTS I THE CASE OF MM 70 



Pierce Barker 



Sol H* Pelavin 



U*S, ©EP^RTMtNTOF HEALTH, 
ISUCAtmN ft WELP&ie 
NATtQNAL INiTiTUTe OF 
l&UCAtlQN 

\ DOfUwfNT MAI BfEN RlPSO, 
iO EXACTLV AS BiCirVE& FROM 
FlRSON OR ORGANl2AT>DN OR?GlN* 
IT POINTS OP View OR OPiNIQNS 
TED DO ftlQT NECESlARiL^ OEPRE- 
TOFFiCIAL NATiONAw INSTITUTE 
rCATIQN FOSITiON OR POLICV 



July 1976 




The Rand Paper Series 

Papers are issued by The Rand Corporation as a service to its profeisionaf staff. 
Their purpose Is to facilitate the exchange of ideas among those who share the 
author's research interests; Papers are not reports prepared in fulffllrrient of 
Rand's contracts or grants. Views expressed rn a Paper are the author's own, and 
are not necessarily shared by Rand or Its research sponsors. 

The Rand Corporation 

Santa Monica, California 90406 



3 



ISSUES OF REIJABILITY MD DIRECTIONAL BIAS 
IN STANDAMiIZED ACHIEVEMENT TESTS i 
THE CASE OF MAT 70 



Pierce Barker 
Harvard University 



Sol H, Pelavin 
Stanford Research Institute 



Presented at the Annual Meeting of the 
American Sducatlonal Research Association * 1976 



4 



INTRODUCTION 

Standardised achievement teste (SAT's) for \issi in elementary and^ to 
a perhaps leaser extemt, secondary adiool grades^ muBt perforee hm deeigned 
to accoffittiodata a very large degree of hetersgefielty of actual performmae 
levsl amang students ^ both croes--sectloMlly and longitudinally, 

I'he classical solution to this ptfoblmm is twofold s wlttltlii a given 
domain of knowledge , a,g.^ vocabulary, a series of tests is conf^trueted ^ 
each of which is i:hought appropriate to a restrictad age/grade ranges in 
the sense that thu 'average- student in a given range should score some-* 
where iie,ir the middle of the raw score distribution. This step is thought 
to be clearly necassary If the problem of axcesslve nmbers of unlnter-- 
pretably l.m and high scores la to be avoided while keeping the test 
length within reasonable bounds* 

However, for reasons explored in detail elsewhere (Barker and Pelavlnj 
1975, e^g,)^ simple aggregations or measures of ceiitral tendency computed 
on Baares obtained from different test levels within a series are 
clearly inappropriate* CSiven the necessity to compute these measures^ 
whether because a given nomlttal grade or classroom contains students 
aehieving at widely different levels^ or because one wishes to follow 
student cohorts through time^ some form of transf ormation of the raw 
scores in the series Is clearly required. This generally takes the form 
of what we will here call a mapping function ^ such that raw scores 
throughout the range on each of the successive levels are mapped upon a 
common baseline, which is taken to be a linear continuum of equal Interval 
units* The assumptions and detiile of the most widely used technique are 
discussed In a number of standafd sources, (Thurstone, 1925 i Angoff , 1971 | 
Guilford, 1954i Gulliksen, 1950, e,g,), and need not be rehearsed here, 
(For a fuller discussion in the present context^ see Barker and Pelavln, 
1975; a more generallied discussion of the problem appears in Porter and 
Chlbucos, 1975.) 

In general, then, the scores derived from the mapping function may 
be considered the basic metric of the system of measurement comprised by 

5 



2 



the various SAT'g. For the particular SAT system upon which this research 

is based, the Metropolitan Achievement Test battery, 1970 edition (Harcourt 

Brace Jovanovich, Inc* , 1972, 1973), this metric is called the scale of 

2 

Standard Scores (SS)." Now, it should be clear that, if this metric 
performs as Intended^ aggregation of scores derived from raw scores by 
the mapping function is appropriate; in fact, it may not be too mudi to 
argue that the intent of the mapplag fimetlon is to transform the several 
levels of the various domain teats into parallel tests In the SB metric, 
CGulllksen, 1950,) 

That is J if one administers, say, two adjactot test levels within 
a single domain to a sample of students under suitable conditions, it 
Is arguable that the correlation between transformed scores (in this 
casse SS) may be interpreted as a measure of the reliability of the 
tests under the assu^tlona underlying the theory of parallel tests. 
However, it aeema more appropriate Cas we argue at some length in Barker 
and Pelavinp 1975) to interpret such a correlation as a measure of the 
goodness of fit of the obtained tranafarmad data to the hypothesis that 
they are parallel, i.e*, as an investigation of the validity of that 
transformation. This is the approach adopted in the present research 
(see slIbq Felavln and Barker, forthcoming | 1976). 

THE RESEARCH SETTING * 

Although an investigation of this sort would seam to have a certain 
amouftt of theoretical interest the present research was in fact motivated 
by a keen concern about the validity of the transformations in the specific 
cont^t of a federally funded and controversial educational intervention 
program for what we may* in admittedly crude shorthMd^ call educationally 
disadvantaged students in San Josep California,. This program^ originally 
officially (and still most widely) known as the Educatianal Voudier Demon- 
^stratlon..CEVD)^ . is„ esdiaus.tlyal^ 

et al*» 1974| Weiner and Kellen, 1974), and the ethnic composition md SIS 
uf the student population'-^-charactarlstics generally considered valid 
indicators of the degree of "aducatlonal dlsadvtotage""la dascrlbed In, 
Mong other sourcaa, Barkar (1974) • 



e 



Suffice here then to say that^ while accelerated gains in 
measured cognitive achievement are neithet direct nor iiraediate theore- 
tical goals of the system of education vouchers aa originally conceived 
(JenckSj et al,^ 1970; Barker, 1975(a)) ^ the mandate to measure cognitive 
achievement delivered to the external evaluation staff suggested strongly 
that cognitive outcomes so measured might well have important policy 

implications; whereupon the dependability of conclusions drawn from 

3 

these raeasuras became an iTOiediate and pressing issue. 

Moreover J this issue was not restricted to the problCT of longitu- 
dinal analysis. Without question^ in part as a result of the educational 
reorganization of schools In the EVD^ such that the intended primary educa- 
tional delivery units became relatively small, teacher^orlglnated and 
teacher-directed progrms (called ^ in the EVDj minischopls) ^ which also 
usually Involved a high incidence of multlgraded classrooms , nominal 
grades, minischools and classrooms did indeed contain students whose 
actual performance levels covered a wide range of test levels within 
a given domain. Consequently, the propriety of aggregation and com- 
putation of the moments of the distributions of test scores was a salient 
issue even for a single testing period, e»g, , for the Fall of a given year* 
(A detailed report of the magnitude of what we may call the out--of -level 
testing problem appears in Barker and Pelavin, 1975,) 

DESCRIPTION OF THE STUDY 

Initially j we Intended to mount a validation study which included 
a wide range of test levels and nominal grades, however j administrative 
strictures laid down by the local school district served to reduce the 
scope of the present study to oti^ nominal grade and two adjacent test 
levels • 

During the regular Fall achievement testing period in 1973, all 
third grade students in EVD schools were given all of the subtests of 

_the = MAT_Primary_.I„battery 2*4 , -as. 

well ap all of the corresponding subtests for MAT Primary II, the "proper" 
level for third grade (3,0) students at this point. In this analysis, the 
Mathematics subtest is excluded, since, as a result of misunderstanding 



7 



on the part of Che teaQharSp this subtest wai omitted for a large number 
of students in the sample. These subtests were atolnistered sequMtlally 
wlthiii subject area^ with corresponding subtests from the two levels given 
4-6 days apart, with the order of administration of levels irt thin subtests 
randomly counterbalanced over students in the sample. 

It is clearly Important that the sample al^e be maximised ; hencej we 
Included the masiraum nimbar of students available within the restrictions 
Imposed by the District, Of the total of 801 students eligible for testing, 
a proportion of valid scores from both levels of corresponding subtests 
was obtained for 93% or more of the smaple* The details of coverage appear 
in Table (1), - 

CQ^gARISON OF CHAMCTERISTICS OF SA^gLE AND STMDARDIZATION DATA 

Assessmgnts of the "reliability" or, perhaps motm properly^ gen- 
erall^abllity of scores obtained on subtests are provided by the MAT 
publishers in the form of estimates of coefficients of internal consistency 
(usually called coefficient alpha or generallised 10120) based upon the data 
gathered from the 1970 edition -f stMidardliation sample | and estimates of 
the standard error of measufemant tSEM) reported are baaed upon these 
reliability estimates * In order to assess the degree of comparability 
of our data with the publisher -s datSp we estimated these coefficients 
and S©i-s from our own data* The results appear in Tables (2*5)* 

To smnmarl^a these results^ we simply ©bserve that in all cases ^ 
local consistency coefficient estimates were greater thw 0^90- and that, 
while local estimates were in all cases lewer than those reported by the 
publisher I the modal difference Is 0*01^ and the greatest difference Is 
0,02, Estlmatas of the SQf are equally comparable. Given the comparative 
homogeneity pf the sample, dlfferancas of this order of magftltude are 
surely negligtbla* On these bases ^ then , th© data Iroffi our saaple seem 
wholly comparable to those^ from jhe pjiM^sh^r^s stand^ardifation sOTple, 

In addition i'rf^hlle published reports do not show the first two 
moments of tha distributions of standardtaatlon datap tha cOTLparablllty 
of SM estlmatas from our data to those provided by the MAT publisher 
strongly suggests that the variances from ^.ha two samples are raasonably 
comparable. 



8 



5 



T^le 1 

NIMBER OF STUDENTS WITH SCORES 
ON PRIMLY I MD PRIMARY II , BY SUBTEST 



Scale N % O^tted^ 

Word Knwledge 780 2^62 

Word ^toalysls 776 3^12 

Reading 744 7,11 



Total 801 



a) For aaeh subtest^ the percentage of the 
total nui^er of atudenti tested; for -whom 
at least one subtest score wai missing* 



9 



Table 2 

RELIABILITY COEFFICIENTS- 



SUBTEST 


^ PRIMARY 


I 


PRIMARY 


II 1 




National 


Local 


National 


Local 


Word Knowledge 


.94 


.932 


.95 


.940 


Word Arialysis 


.94 


.916 


.93 


.907 


Reading 


.96 


.948 


.95 


.935 



a) Source f MAT Teacher *s Handbook, 

b) Computad as KR20^s: Sea footnote (5), 



10 



7 



Table j 

STMDARD ERRORS OF ^EASUmffiNT 
FOR THREE SUBTESTS % GRADE EQUIVALENTS 



SUBTEST 


PRIMARY 


I 


PRI^mRY 


II 




National^ 


Local 


2 

National 


Local 


Word KnOTladge 


^ *2 


.24 


.3 


*20 


Word Analysis 


.2 


.26 


.3 


.30 


Reading 


,2 


*22 


.3 


.23 



1) Fall itandardizatlons grade ^ 2,1, 

2) Spring standardizations grade » 2.7, 



11 



8' . 
Table 4 

STANDMD ERRORS OF MEASUREMENT ' 
lOR mmE SUBTESTS I STANDAra SCOIUES 

SUBTEST 



Word !&iowledga 
Word Analysis 
Reading 

1) Fall etandardizatloni grade ^2*1. 

2) Spring standardizations grade ^ 2*7, 



PRIMASY 


I 


PRIMARY 


II 


National^ 


Local 


2 

National 


Local 


2.7 


3.2 


2.5 


2.5 


2.3 


2.8 


2.8 


3.2 


2.5 


2.8 


2.7 


3.2 



erJc 



12 



Table 5 

STMDARD ERRORS OF ^ASUraMNT r 
FOR THREE SUBTESTS l RAW SCORES 



SUBTIST PRDIARY I PRIMMY II 





National^ 


Local 


■ 2 

National 


Local 




1.7 


i.8 


2.0 


2.5 


Word Analysis 


2.0 


2.1 


2.0 


2.4 


Reading 


2.2 ^ 


2.4 


2,3 Z 


2.8 



1) Fall steUidardization: grade 2,1. 

2) Spring standardization I grade ^ 2.7. 



ERIC 



13 



J...... 

1 ■ 
DESCRIPTION OF A NALYTIC FRAMEWORK 

^ — " " - ^ — ^ . ■ ". / 

■i At various points heretofore, we have Indieated a coneem with the 

validity or usefutoess of the SS scale on both the individual and aggregate 

ieveli* Our analytic frmework, thenp is designed to take mecount of both 

of these areas i for convenience, we discuss them seriatim, At this point, 

we should mention that, while much of the original malysls was repprted 

in a conteKt of compartsops of various linear models | our present die 

cussion, because of time and space limitations, will be restricted, on 

the Individual level, to an assessment of the goodness of fit of cor^ 

relations obtained from the data, both observed «id corrected for 

putative error of mea.furement, to the publisher's implicit hypothesis 

that, in terms of transformed scores (SS) , it doesn't matter which test 

level we use* Note that this paraphrase of Gulllksen's (1950) vernacular 

definition of parallelism is intentional i if this implicit hypothesis 

is to receive support. It should be the case that correlations between 

subtests within domains, when based upon trarisf owed scale scores CSS) , . 

should approximate the reported reliabilities of either level| md 

this, of course j iB^lies that dlsattenuated correlations should approach 

1.00.^ 

Of course, one does not expect complete subs tltutability across all 
test levels. For exai^le, a sample of eighth grade students would all 
be expected to get an essentially perfect score on a test Intended for 
second graders, almost without regard to their scores on a. test for 
ei^th graders. For adjacent tests^ however, or for tests two steps 
apart, the tests are designed to permit substltutability , as the pub- 
^ lishers themselves have said, 

INDIVIDUM. LEVEL ANALYSIS 

Although our prima^ Interest is in the performance^ of- the SS, we 
also included in our analyses parallel assessments of both the Grade 
Equivalent (GE) scale, since It Is, whether merited or not, widely ummd* 
and the raw scores, primarily as a baseline, CinasmiA as the trtos- 
formation procedure used to relate raw scores to SB assimes that the 
various raw scores are related by a linear transformation. It may appear 
that the use of raw scores does not provide a true data baseltoe, as 



14 



11 



, against the thBoretlcal baaellne dlacusaed above. In the event, however , 
n©thlng more euitable Is ayail?blep we inalude it hare as a mat tar 
of Interest »} 

For each of the si± tests Included In our itudyp. we report In the 
following tables p for Glj SS and rw seoraa, estimates of the obseri^ed 
Qorrelations and eorrelations corrected for hypothesised error of 
measurement , 

Now^ since we discuas in considerable detail elsewhere (Barker rad 
Pelavln^ 1975) the propriety of using consistency coefficients as estimates 
of reliability I we sii^ly briefly outline the argiment here* The siD^lest 
and perhaps most defensible interpretation of eoefflalent alpha in the 
context of test theory is as an index of "behavior domain validity" 
(Tryon, 1957)s i#e*5 the correiatlon between scores on a staple frra 
a domain and scores on the total dOTaln*^ On this interpretation (and 
derivation: cf^ e.g.* Kaiser and Ittchaal, 1975) ^ scores in the domain 
are taken to be "true scores j" and, perhaps more t© the itroedlate point, 
the only source of error of measurement theoretically allwed is error 
arising from the fact that a domain is stapled instead of exhaustively 
surveyed in any particular test. 

Hence 5 if estimates of the SEM of a test are. to be based (as those 
of MAT 70 are) on estimates of alpha, the implicit claJja is that sOTpllng 
error alone contributes to error of measurement | and it follows that, if 
the validity of this claim is to be tested ^ as here, then estimates of 
alpha are the correct disattenuatlon coefficients ^ (Barker Mid Pelayin, 
1975; Barker, 1975(b))* 

Moreover, as should be clear from the description of the design of 
this study, the interval of time between administration of; alternate 
levels of any subtest is clearly too small for amy measurable expected 
change In actual knowledge to occur, (That Is, changes iii the relative 
rmk order of the students between test points can hardly be attributed 
to differential rates of ^ange of learning, since, on thii whole, only 
negligible amounts of learning cm be ejected to occur*) 

Finally, a check of the data for order effects disclosed that no 
significmt effects were present | this finding was confirmed In Pelavin 
and Barker, (forthcoming; 1976) • * 

15 



12 



A cursory examination of the reaulta presented in Tables (6-8) is 
aufficient to show that between^level correlations within subtest areaa 
(domains) do not approach at all elosely the putative reliability of 
either level, nor do the disattenuated correlations approach 1,00, Neither 
do we observe any significant differences in the values of these estinratee 
dependent upon the scale used| in fact, the estimates based upon raw scores 
and SS are literally Indistinguishable, We may simttarize these findings 
by quoting the means over all scales and subtests: KBAR for observed 
scores ^ ,739; for disattenuated estimates , RBAR ^ ,798. 

It is also clear, then that if these estimates from observed data 
arm to be taken as parallel form estimates of reliabilities , the resultant 
estimates or the SEM must be far higher than those reported by the publisher 
than those computed on the basis of estimates of alpha from our own 
data. Furthermore, given that wlthin^class variances are homogeneous ^ a 
hypothesis that cannot be rejected on the basis of our datas the errors . 
of estimate based upon these data (Itfe*, the pooled stmidard deviation of 
residuals about the regression line predicting subtest scores on^bne 
level from those nn another) must be quite large ; in fact, on the whole ^ 
they will approximate 2/3 of the standard deviation of obtained scores.^ 

On the whole J then, we must conclude from these data that the validity 
uf the SS scale on the individual level has not been demonstrated » We 
will return to a discussion of this finding In the penultimate section 
of this paper, following the presentation of results of the aggregate 
level analysis. 

AGG^GATE LEVEL MALYSIS 

In educational evaluation, the following situation is not at all 
uncommon: one wishes to assess the amoimt of cognitive gain of some 
group of students over, say, a period of one year, Now^ as we have 
pointed out above, it often happens that testing students at both points 
in tlms with the smie test is Inappropriate i this is Just the situation 
of longitudinal coB^arisons for which, as we have said* score trans- 
formations (in this case, the SS trMSformatlon) are in part designed* 
If we assume that the SS on two adjacent levels of some domain Csubtest) 



16 



X3 



Tabla 6 

COBMLATIONS BETWEEN CORRESPONDING SUBTESTS , 
GRADE :EQUIVM.ENT UNITS , CORMCTID FOR ATTENUATION 



CD 


(2) 


^ (3) 


(4) 


(5) 


(6) 


(7) 


C8) 


(9) 


lUBTEST 


. N 


^;^?I,I1 


.1,11 


^k,i; 










1 Knowledga 


780 


.660 


.436 


.932 


.940 ; 


.705 


.497./ 


.061 


1 ^alyala 


776 


.795 


.632 


,916 


.907 


.872 


.761 


.129 


ling 


744 


,694 


.482 


,948 


.935 


.757 


.543 


.016 


i 




.716 








■ .778 







^ raw score correlatloiaL 

^ raw score eommon variance 

(6) ^ astimated from praeent data 

— raw score correlation with attenuation oorreetloni estimated "true score" 
eorrelation 

^ estimated **true score" coimon variance 

^ estimated common variance increase [ (8) - (4) J 



17 



ERIC 



mm 'V-' 



14 



CORRILATIONS BllWEEN CORRESPONDING SUBTESTS , 
STMDARD SCORES , CORMCTED FOR ATTENUATION 



rr: '; CD - - 


(2) 


. (3) 


(4) 


(5) 


(6) 


(7) 


(8) 


(9) / 


SUBTEST 




^1,11 


^ I.il 


\k,I 


^,11 




^ tt 


r2 

^ Dif f 


; V Word Knowlsdge 


780 


.718 


.516 


. .932 ' 


,940 


.767 ; 


.588 


•07it 


Word Analysis 


776 


.825 


, .681 


.916 


.907 


.905 


.819 


.138 


' Reading 


744 


.708 


.501 


.948 


.935 


.752 


.566 


.061;/ 


V Mean 




.750 








.808 







(3) ^ raw score correlation 

HC4) ^ raw aeore eonmon varlmce ^ . . / - . . . 

(5)5 (6) = K--R!-20 estimated from preaant data 

(7) ^^ raw score correlation with attenuation correction! estimated "true score"' 



correlation 

\ )- (8) ^ estimated "true score" conmon variance 

C9) ^ estimated common variance increase [ (8) (4) J 



18 



ERIC- 



15 



Table 8 

CORRELATIONS BETWEEN CORSESPOKIDING SUBTESTS, 
RAW SCORES , CORmECTED lOR ATTENUATION 



':V; SUBTEST 


N 




^^IpII 


, \k,i 


\k,lir 








^rd B^owledga 


780 


:6B7 


.472 


.932 


.940 


i. ■ I 4- 1-1 

.734 


,539 


.067 




776 


.786 


.618 


.916 


.907 


.862 


.744 


.126 


eadlng 


744 


.778 


,605 


.948 


.935 


.826 


,683 


.078 . 


ean ^ ' • 




.750 








.807 







3) ^ raw^'SGore correlation " 

4) ^ raw a core coBmon variance 

W i (6) ^ estimated from present data -^^^^ 

7) ^ raw score correlation with attenuation correctioni eetlmated "true score" 
1 correlation: ■ 

3) — estimated "true score''' conmon variance 

J) estimated comon variance increase [ CS) ^ (4) J 



: 19 

ERIC"-- --"-- ■ - 



are not divBetiomtty Sioaedj then It would seem reasonable to compare v 
says Spring with fall seoras in transforred score tmits, and taker s 
the mete difference over the group of intereit as an tmbiased eatliaate 
of eopiitiva gain In the domain, _ / ^ 

Now note that, in aubatitutingj as it wem^^one tesft level for 
another, wsaubj eat our saorei to what Gulilkasn C195D) calls error of 
substitutlony however, this adjj^ly means that , In addition to the usual 
error of measurament, we incur some error Incremant by comparing too 
presianably independent samp 1m of a domain with eadi other. It does 
not, ho%rever, eatall what wa are here calling Mpeatianat bioB^ The 
nature of this bias is explicated in detail in Barker and Fela^ 
hare, we simply provide a brief illustration* 

Letting I, 11 IndeK different test levels (where, in the present 
case, II ^ I + 1) , and 1, 2 index different times of administration (Fall 
and Spring, say), we wish to assime that 

1(112 II) ^ g, say, CD 
where E is the eKpected value operator, f 
and g denotes *'true growth*^^ \ ' 

Mow, wa would assima that, in my imits, 

■ •• E(ii2^« 111);^ gi - : 

i,e., (2) dlffera from (1) only in that we have substituted T fox 
time (1), a procedure which the SS (or equivalent tirrasformation) Is 
designed to permit. " 
Therefore, writtag the identity v ^ * 

h ve CI12 - II) ^ (112 ^ 111) + (111 ^ II) , : r 

E(1I2— 11) ^ E(II2 ^111) + l(m> U) ^ g (3): 
if E(1I1 - II) ^ 0. ; (4) 



That is , if (4) does not hold ^ thea: tha comparison (1) Is diTeotimally 
biaSBd; In this ease, tha validity of (4) is teatahle from the data at ' 
hand; and a finding of statistieally significant departures of (4) from 
zaro would Indieata that any differences found were likely to be reliable, 
-^t would then remain to consider how practically significant such dlf-^ 
^ ferences might be. * . . , 

Said another way , a f inding that (4) does not hold would indlMte 
^ thatj ceteris paribus, it d^^a matter whidi level of a domain^tast ia" 
v; "^®^t ;^ comparison of an aggregation of students on 2?(3tA levela^" V ' 

' -'administered at virtually the^^smf time (i) would shw some reli^le" ^ 
difference in, say, the mean of their scores ^ attributable almost sol^^ 
to the level of the test of the domain which was administered. 

The results of aggregate cOTparisons for all^ s^tests Cdomaim8)^f 
both GE and SB units appear In Tables (9-10). 

, . An examination of these tables shows that^ for all three Gl con^- v 
parisons and two of three SS eomparisone reliable directional bias- . 
appears (for these five ^ m^imtan p - ,002)v The fact that; f©r the 
vocabulary subtest , p ^ * 236 for SB , may be considered an argtoerit in 
favor of the SB scale as against the Gi| however, the others evidently 
reliable dif f erences , are quite large ^ when we consider that they con- 
stitute meOTS over a minimum of 744 students. 

Now, the magnitudes of these dif f erences in SS have no prima facie 
interpretation, although they may be given a rough interpretation in 
_ terms of .percentiles (see Barker and Felavin, 1975), However, the 
closely related GE differences may be given a rough interpratation in 
the following way. If we assume ,that .1 GE is the equivalent of about 
one month's gain for the average student , the dif f erences we find are 
the equivalent of 10% -- 20% of an expected year's growth , although 
these differences appear from tests aininis tared over a period of less 
than one week. 

On the other hand^ these are not , in- the usual parlance, average 
students; for students of the sort at hand, the usual estimate of 
ejected growth" over a school year is about 0.6 GE (ef^ e.gyy Feimessey ^ 
1973). If we adopt this rough expectation, the size of the directional 



i'av-.-i'-."'W.-sf. 



18 



Table 9 



TEST 01 lEAN DIFFERENCES PRIM^lf 1 AND PRIMARY II 
GRADE EQUIVALINTS, WITHIN CORRESPONDING SUBTESTS 



SUBTEST 



MAM 
DIFFERENCE SD 
( I-II) 



SE 



DP 



95% CONFIDENCE 
LIMITS ; 
UPPER LOWER 



Word Ibowledge 
Word jtoalysis 
Reading 



.082 .721 .026 779 
-.183 .602 .022 775 
.084 V721 .026 743 



3.155 .002 
8.442 <.001 
3.188 .002 



.133 .031 

-.140 -.226 

" \ ... ...... 

.135 .033 



22 



g SUBTEST 



19 



Table 10 ■ ..■ 

TEST OF mm DIFFIRINCIS PRDIARY I MD PRmARY II 
STANDMD SCORES, WITHIN CORRESPONDING SUBTESTS i 



HEAN 



95X CONFIDENCE 



DIFFERENCE 
CI-II) 


SD 


SE 


DF 


t 


P 


. LIMITS 
UPPER LO^ffiR 


0.364 


3.564 


0.307 


779 


1.187 


, .236 


0.966 


-0.238 


-2.115 ■ 


6.106 


0.219 


775 


-9.642 


.001 


-1.686 


-2.554 


1.497 


9.307 


0.341 


743 


4.385 


.001 


2.165 


0.829 



jli*d Ibowladga 
)r<l Ahalysii 



iadlng 



23 



bias here Is in the rimge 16% - 331 of a year- s growth i vei^ large pro*^ 
portions Indeed^ if we aonslder that thay appear to be futictlcing of the 
syatem of meaaurement Itself ^ not of real gainsj and that ef facts q£^^^ 
eduoatlonal "treatments'* are not usually so large (Avstch, en ai*p 1970; 
Levin/1970| Grain, 1973; Acland, et al*, 197S| e.g.). 

That Is p If proportional gains of this order of magnitude relative 
to expectation were found over the course of a year for educatrldnal inter 
ventionSp and they were asaimed to he unbiased (an untes table asBiut^tlonp 
under most circusistraees*--^hich ±b^ of course^ why we are testing it 
here^; depending upon the direction of the bias, the Interventions in 
question might be judged either rather sensational suoaesees or fairly 
disappointing* / \ 

And this J of course, is precisely how cognitive seores are likely 
to be used by evaluators md/or policymakers| hence ^ the finding of 
reliability biases of this magnitude is fairly disturbing.' 

Now, of course I in a true rwdomi^ed e^erimental desi^ y systematie 
bias of this sort would not bias estimates of treatment effects^ ceteris 
paribus; unfortunately, most edueational evaluation cannot elalm even 
incomparable cos^arlson groups (for which bias of this sort could make 
a difference) I the paradigm of collaring observed with expected dif*- 
ferences is far more conmoni and it is for just such co^arisons that 
bias of the kind here discussed and illustrated is confowded with "true" 

growth. X ,. . 

As discussed more fully in Barker and Pelavto Cl9 75) , we find no 
reason to attribute these biases to floor md/or ealling effects p and, 
in any case , if we were to make such an attrlbutloi . the finding that 
the diiferences are n^t "always In the same direction would seem to 
invalidate the attribution. 

In short p of the &ro classes of findings presented here , those just 
discussed seem to us to be potentially the more serious* If we consider 
the task of the evaluator and/or policymaker to be one of binary clas^ 
slfication of educational treatments into what' we Mght call go /no go 
categories, then we can see thaJ/ it is as trae, of treattnent^ 
of persons that. If scores are to be used for classification, they must 



be quite highly reliable. But this Is just what we have found the scores 
under Investigation n^t to be. 

DISCUSSION \ , 

Alternative explanations for the findings presented here are es^ 
haustively discussed in Barker and Pelavin (1975) • but ^e most likely 
eonelu'^ion may be more briefly pressed here. It is sl^ly that ^ when 
we realize that the measurement system whidi we are discussing here was 
not subjected by the publisher to any knowi validation of the sort here 
reported— granting at once that this is, as we all know, no simple thing 
to do| the likeliest eKplanat ion for the results found (and replicated i 
Pelaviii end Barker, f orthco^ngV 1976) is simply of the 

basic metric of the system. 

For example j It can be shown (e,g.. Barker, 1975Cb)) thatp given 
the assumptions underlying the estimate of the SEM by the publisher, 
dlsattenuated between^level correlations are a function of the mean 
item covarlmces within each level and between levels | that , in fact, 

RCT^.y'^ f . C5) 

where denotes dlsattenuated scores on test tw) , 

^ip ~ squared item covarlmces between levels, 

and j , - mean item covariances within each level. 

If all of these mean covarlMces are not roughly equal to the extent 
that the mean between covariance is less than the geometric mera of the 
mean within covarlKices, the between- level dlsattenuated correlation will 
be less than 1,00. However, this would Indicate, on the assmaptlons 
imderlying reported estimtes of Sm, that rather differenc domlns were 
sampled: for tests of this sort, this seems generally unjlkely. 

We should add that, while, to our knowledge, the stifles reported 
here and in Pelavln and Barkar, Cf orthcomlngi 1976) are the only ones 
extant that set out systematically and specifically to test the validity 



of system metrieSj it la not the only one ^idi suggests dlreetlonal bias 
(seep e*g», Ayrer and McNamara, 1973). 

In short 1 we cmnot oonclude with any confidence that the Instru* 
ments heEein assessed provide dependable bases either for Indlvlduai 
student assessment, or program evaluation as usually performed* 

CONCLUSION 

At this point p given eurrent practice In evaluation 
it seeins natural to ask, What is the practical Import of these findings? 
In short, what are we to do? . ! . 

Unhappilyp we cmnot, on the basis of these analyses, present any 
clear Miswer to this question | we may, however, present some suggestions, * . 

One of these is thought to be rather difficult Cby some, l^ossibie) 
to implement. We feel that the difficulty is wcaggerated, but that- does : 
not alter the feelings of those who are responsible for evaluation. It 
Is simply that many more evaluations than at present, be designed as 
randoMzed true experiments, rather thm the quasi-- or non--expariments 
that are the rule today. This would at least enable us to have a bit 
more confidence that estimates of treatment effects were, in truth |? the 
imbiased estimates which we must usually, when assessing the treatments, 
at least li^llclt]^ assraie that they are, / 

Secondly, we would suggest that ttie sole dependence whi^ we so 
often find upon scores from SATs be rather radically changed. At the - - 
very least. It would scmi the better part of wisdom to administer more^^^^ t^ 
one battel' of audi tests, time consiraing thou^-'that may be? if the r 
results from multiple administrations me nSt^ convergent , caution in- - v 
interpretation is of course indicated. CNote7"hoWver, that convMgence 
is n^t proof of validity of either or both sets of scores J Even bat tM^ ,^^^^^ 
It seems to us, would be the additional actolnistration of tests specif leall^^ 
designed to measure learning of just what is taught. Not that; this is : 
easy to do, either* quite the contrary. But one must recall that 3ATs 
are not validated in the sense In whl^ we usually thl^ of validation, 
1, e* , against an explicit criterion rather , as Goslln C1967) , C^ong 
others), points out, SATs are themselves in a irery real sense tdeen 
as am^tsx^al. But, for that very reason, given that they are designed 



23 



for extremely wide usage, they are truly criterial for few if any real 
existent programs or curricula. 

Despice the argument made above that floor/ceiling effects do not 
account well for the present findings^ it is expectable in general, and 
true in this case, tKat the distribution of raw scores on the lower level 
test is somewhat skewed relative to the upper. Since the scaling method 
used to map raw scores onto a common CStandard Score) scale requires, for 
validity, only that the two sets of raw scores be related by a linear trans- 
formation, clearly the method takes no account of the third moments of the 
distributions* (See also Gulliksen, 1950,) While the relative skew in 
these data is not large (Barker and Pelavln, 1975), it is arguable that fail- 
ure to correct for even a small relative skew could invalidate the scale. 

Two remedies suggest themselves for this situation, apart from develop-- 
Ing a method which does take account of third moments. One of these amounts 
to decreasing the number of levels of the test, while Including e certain 
amount of overlap j Indeed, there is informal evidence that this meliorates 
the problem. The other would involve administration of a careful pretest, 
so that individual students would be administered the level On which they 
would be most likely to achieve a score in the middle of the range. Research 
is underway to assess the usefulness of this strategemi however, data pre- 
^.sented^ln detail,dn Barker and Pelavin (1975), comparing between-level 
score differences for students grouped into quar tiles on one level, 
suggests that rather large differences remain even for students relatively 
near the center of the distributions on both levels. 

Again, if dependence is to be placed upon test scores , for a program 
of any scope or importance, it might well be necessary, if SATs are to be 
used, for the evaluators to undertake extensive and rigorous metric valida^ 
tion and, if required, reconstruction prior to beginning the evaluation. 
Now, this is no doubt a difficult and expensive undertaking, but neither 
so difficult nor so expensive as developing and fielding the progr^s 
which are to be evaluated. If this greater sum la not to be placed at 
hazard by relatively unreliable and invalid assessment crlteria^ — an unthink- 
able, but nonetheless widespread phenomenon--^one can only, we believe, 
conclude that these difficulties md expenses must be conquered md paid. 
We have tried to set out some of the ways in which this might be done, 

27 ' . 



ERIC 



24 



NOTES 



1* This research was undert^en while the first author was associated with 
The Rand Corporation^ and was supported by the National Institute of 
Education. The authors would like to expreBS their deep gratitude for 
the invaluable advice and assistance of Dr* I» S. Donaldson and Carol 
N. Frost, both then at The Rand Corporation; to Professor Ward Keesling, 
UCLA; and Mr. David R. Mandel^ NIE. 

2. We should make it very clear that this research is not intended specif id- 
eally to criticise the MAT- on the contrail, the l^T was chosen pre- 
cisely because it has been foimd (Hoepfner, et al., 1970) to be 
exeaqjlary of the genre. It is our belief that the findings reported 
here are probably applicable to mostj if not all, of the SATs in 

wide use. 

3. It seems likely that this state of affairs is in no email part a result 
of the seemingly irresistible pressure upon both sponsors BXid evaluators 
to attempt to measure cognitive growth or status even when that la not 
the sole or even primary aim of the program, ThiSj in turn, is probably 
because, in fact, it is widely felt that cognitive outcomes ax'e important 
(which is probably true) md that they, among the range of possible 
outcomes j are unusually easy to measure (which is extremely doubtful). 

4. However, following the outcome of the study herein reported, the issue 
was dee^ md of sufficient importance to moLmt a much wider study , which 
was in part a replication of this one. ^ese results are reported in 
Pelavin and Barker (1975 | 1976) | and they do, in fact, sup^Q^t the 
results Bnd conclusions reported here. 

5. In fact, the publisher's estimates are based upon one of the Saupe 
(1961) estimates of KR20| however, the differences observed in practice 
between the estimates based upon the two procedurefl are entirely 
negligible, 

6. For a detailed discussion of this matter, see Barker and Pelavin (1975) i 
and Barker (1975Cb)). ' 

28 



25 



7. That is, if we let 



V = variance of obtaiued scores , 

V = variance of residuals, 
r ' 

then = V Cl - rj^2^^» " 



or SD «Vv~'\/Cl - .546) = .674Vv7 



29 



ERIC 



26 



MFE^NCES 



Acland, H., Barker, P,, Grain, R. L., Pelavln, S. H, , and Sitgreavas, R. , 
InVBBtigation of the Impaat of the mnergenoy School Assiatmae Progrmn 
on Blaak^ Male ^ 10th Grade Aahievement^ Santa Monicai The Rand Corpora-^ 
tion , 1975 ^ unpublished . 

Angoff, W. H, J Scales norms and equivalent scores. In R* L* Thorndlke 
(Ed J I EduQational Measurment^ Second Edition, Washingtoni American 
Council on Education, 1971. 

Averch, H* A., Carroll, J., Donaldson, T. S*, Kiesling, H* J,, and 
PincuSj J., How' Effeative i& SahooUng? Englewood Cliffs, New Jersey i 
Educational Technology Publications, 1974* 

Ayrer, J, E. and McNmara, T, C*, Survey testing on an out of level 
basis, Journal of Eduaational MeaBurement^ 1973, 10^ 79-83. 

Barker, P., Analytia Approaahes to Longitudinal Race md ClasB Indiaators, 
Santa Monica i The Rand Corporation, 1974^ unpublished* 

Barker, P., ISBuea in Measuring Student Co^itive OutaomeB in Eduaation 
Interventions.* The CoBe of Aim Roak^ Santa Monica i The Rand 
Corporation, 1975Ca)^ unpublished. 

Barker, P*, Test reliability and the correction for attenuation, 1975 (b) , 
(in review) * 

Barker, P. and Pelavln, S. H*, Conaeming Saores and Saale TrmBfomiationB 
in StandardiMed Aahievement Teste^ Their Aaauraay and Dependability for 
ImHvidual md Aggregations The Case of 'MAT 70^ Santa Monica r The 
Rand Corporation, 1975 ^ unpublished* 

Crain, R. L*, Southern Sahool&^ Volumes I and II, Chicago: NORC, 1973. 

Fennessey, J*, Using Aahievement &Mth to .tnaZyae Eduaational Progrms^ 
Baltimore! Johns Hopkins University, Center for Social Organisation 
of Schools, 1973, ■ 

Goalin, D. A., CHtiaism of Standm^diBed Te&tB and Testing^ Princeton t 
CEEB/ETS, 1967, 

Guilford, J, P,, PByahometria MethodM, Second Edition, York* McGraw^ 
Hill, 1954, 

Gulliksen, H, , meory of Mental Tests, New Yorki Wlleym, 1950. 

Harcourt Brace Jovmovlch, Inc., Demlopmmt of th& Stmdard Score System 
for the 1970 Edition of MAT ^ Yorki Author, 1972, 



27 



Harcourt Brace Jovanovlch, Inc., Devmtopmmt and Use of thm Grade Equivalent 
Scale ^ New Yorki Author, 1973, 

Hoapfnerj R, ^ at al. ^ CSB Elemmtcay School Test EvaluationQ^ Los Angeleei 
UCLA^ Center for the Study of Evaluation, 1970* 

Jencks, C, and staff, Eduaation VouoherB: A R&port on Fincmaing Et&m&ntax^ 
Eduaation b[ GTontB to Parente^ Cambridge i Center for the Study of 
Public Pollc> , 1970, 

Kaiser, H. F. and Michael, W, B., Domain validity and generaligability ^ 
Eduaatianal and Fayohologiaal MeaawpemBnt^ 1975, 35^ pp. 31-»36# 

Levin, H. , A new model of school effectiveness, in U.S. Department of 

HEW, Bo TBaah&m Make a Viffmrma&? Waehlngtoni USGPO, 1970* e 

Pelavin, S, and Barker, P., 4n InVBBtigation of the Gm^eraliMc^ility of 
Scaled Saores in MAT. 70^ Santa Monica i The Rmd Corporation, forth^ 
coming. 

Pelavin, S* and Barker, P*, 4 study of the generaliBc^ility of strndardimd 
achievement te&tB^ presented at the Mnual Meeting of the American Educa-' 
tlonal Research Association, 1976* Santa Monica! The Rand Corporatloni P-5678 

Porter^ A, C. and Chibucos, R, R, ^ Gomnon problems of design and analysie 
in evaluative research, Sooiologiaal MethodB md Resecwah^ 1975 1 3^ 
pp, 235^257. 

Saupej J, L,, Some useful estimates of the Kuder-Riehardson formula nimiser 
20 reliability coefficient ^ Eduaational md Psyahologiaat Measwpment^ 
1961, 21, pp* 63-^71. 

Thuretone, L* L*, A method of scaling psychological and educational tests. 
Journal of Eduaational PBydhology^ 1925, Jfl, pp* 433-451. 

Tryon, R. C* , Reliability and behavior domain validity i Reformulation 
and historical critique, Peyahologiaal Bulletin^ 1957^ B4^ pp* 229- 
249. 

Weller, D., et al*,.4 PiSHq Sahoot Vmaher DemanBtmtiom The Fivat Xea^ 
at Aim Hook^ Santa Monicai The Rand Corporation^ 1974, R-149S-NIE. 

Welner, S* S* and Kellen, K. , The Politiae and AMnistratim of the 
Vouaher DemonBtration in Alum Rook: The Fir&t Tearj Santa Monica i 
The Rand Corporatloni 1974, unpublished* 



31 



